Exploration of Red Wine Quality by Kyungwon Chun

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

This tidy dataset contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating from 0 (very bad) to 10 (very excellent).

Univariate Plots Section

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The distribution of fixed acidity is positive skewed. Most of the wines have fixed acidity between 7.10 and 9.20.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The valatile acidity shows a bimodal distribution and positive skewness. Most of the wines have volatile acidity between 0.39 and 0.64.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.120   7.680   8.445   8.847   9.740  16.285

Total acidity is composed of fixed and volatile acidity. The distribution of total acidity is positive skewed with median at 8.445.

The residual sugar shows left-biased and long-tailed distribution.

The chlorides show left-biased and long-tailed distribution.

The total sulfur dioxide has some outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

Most of the wines have a density between 0.9956 and 0.9978.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Most of the wines have pH between 3.210 and 3.400.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Most of the wines have 5 or 6 in quality.

Univariate Analysis

What is the structure of your dataset?

There are 15,999 red wines in the dataset with 13 features (X, fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality). X identifies the wines, and quality represents that how good the wine. The X and quality are unordered and ordered factor variables, but I treated them as numerical variables for convenience. All other features represent chemical properties of wine.

Other observations:

  • Wines with quality 5 or 6 are most common.
  • The median wine quality is 6.
  • Most wines have a quality of 5 or better.
  • About 75% of wines have a quality of 6 or worse.
  • The worst and best quality in the data set is 3 and 8, respectively.

What is/are the main feature(s) of interest in your dataset?

The main feature in the data set is quality. I’d like to determine which features are best for predicting the wine quality. I suspect quality and some combination of the other variables can be used to build a predictive model for wine quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

The primary wine characteristics are sweetness, acidity, tannin, alcohol, and body. Residual sugar, fixed and volatile acidity, alcohol, and density determine those characteristics. I guess that these variables are mainly related to the wine quality.

Did you create any new variables from existing variables in the dataset?

I created a variable for the total acidity using the volatile and the fixed acids.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Volatile acidity shows a bimodal distribution.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                 1.00            -0.26        0.67
## volatile.acidity             -0.26             1.00       -0.55
## citric.acid                   0.67            -0.55        1.00
## residual.sugar                0.11             0.00        0.14
## chlorides                     0.09             0.06        0.20
## free.sulfur.dioxide          -0.15            -0.01       -0.06
## total.sulfur.dioxide         -0.11             0.08        0.04
## density                       0.67             0.02        0.36
## pH                           -0.68             0.23       -0.54
## sulphates                     0.18            -0.26        0.31
## alcohol                      -0.06            -0.20        0.11
## quality                       0.12            -0.39        0.23
## total.acidity                 0.99            -0.16        0.63
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                  0.11      0.09               -0.15
## volatile.acidity               0.00      0.06               -0.01
## citric.acid                    0.14      0.20               -0.06
## residual.sugar                 1.00      0.06                0.19
## chlorides                      0.06      1.00                0.01
## free.sulfur.dioxide            0.19      0.01                1.00
## total.sulfur.dioxide           0.20      0.05                0.67
## density                        0.36      0.20               -0.02
## pH                            -0.09     -0.27                0.07
## sulphates                      0.01      0.37                0.05
## alcohol                        0.04     -0.22               -0.07
## quality                        0.01     -0.13               -0.05
## total.acidity                  0.12      0.10               -0.16
##                      total.sulfur.dioxide density    pH sulphates alcohol
## fixed.acidity                       -0.11    0.67 -0.68      0.18   -0.06
## volatile.acidity                     0.08    0.02  0.23     -0.26   -0.20
## citric.acid                          0.04    0.36 -0.54      0.31    0.11
## residual.sugar                       0.20    0.36 -0.09      0.01    0.04
## chlorides                            0.05    0.20 -0.27      0.37   -0.22
## free.sulfur.dioxide                  0.67   -0.02  0.07      0.05   -0.07
## total.sulfur.dioxide                 1.00    0.07 -0.07      0.04   -0.21
## density                              0.07    1.00 -0.34      0.15   -0.50
## pH                                  -0.07   -0.34  1.00     -0.20    0.21
## sulphates                            0.04    0.15 -0.20      1.00    0.09
## alcohol                             -0.21   -0.50  0.21      0.09    1.00
## quality                             -0.19   -0.17 -0.06      0.25    0.48
## total.acidity                       -0.11    0.68 -0.67      0.16   -0.08
##                      quality total.acidity
## fixed.acidity           0.12          0.99
## volatile.acidity       -0.39         -0.16
## citric.acid             0.23          0.63
## residual.sugar          0.01          0.12
## chlorides              -0.13          0.10
## free.sulfur.dioxide    -0.05         -0.16
## total.sulfur.dioxide   -0.19         -0.11
## density                -0.17          0.68
## pH                     -0.06         -0.67
## sulphates               0.25          0.16
## alcohol                 0.48         -0.08
## quality                 1.00          0.09
## total.acidity           0.09          1.00

The fixed acidity and volatile acidity has strong positive and negative correlations with citric acid.

The pH has a strong negative correlation with fixed acidity, citric acid, but does not with volatile acidity.

The fixed acidity and alcohol have significant positive and negative correlations with density, respectively.

Most of the variables do not seem to have strong correlations with quality, but alcohol and volatile acidity have moderate positive and negative correlation with quality, respectively.

The strongest correlation in this data set appears between fixed acidity and pH. High acidity means low pH, and the graph coincides with this fact.

Citric acid is ne of the main component of fixed acidity. Therefore the two variable has a strong positive correlation.

The fixed acidity has a strong positive correlation with density, too.

Yeast in wine convert citric acid to acetic acid, most of the volatile acid. Therefore, volatile acidity and citric acid is in a reverse relation.

The citric acid has moderate negative correlations with volatile acidity and pH.

The alcohol and density also show moderate negative correlation.

Quality of wine tends to increase as volatile acidity decreases, because the main component of volatile acid is acetic acid which causes an unpleasant vinegar taste.

## 
## Call:
## lm(formula = quality ~ volatile.acidity, data = wqr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.79071 -0.54411 -0.00687  0.47350  2.93148 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.56575    0.05791  113.39   <2e-16 ***
## volatile.acidity -1.76144    0.10389  -16.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7437 on 1597 degrees of freedom
## Multiple R-squared:  0.1525, Adjusted R-squared:  0.152 
## F-statistic: 287.4 on 1 and 1597 DF,  p-value: < 2.2e-16

Based on the value of R-squared, volatile acidity contributes only about 15.2% of the Wine quality.

## 
## Call:
## lm(formula = quality ~ I(sqrt(alcohol)), data = wqr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8551 -0.4087 -0.1711  0.5115  2.5870 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.0237     0.3538   -5.72 1.27e-08 ***
## I(sqrt(alcohol))   2.3756     0.1096   21.68  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7101 on 1597 degrees of freedom
## Multiple R-squared:  0.2274, Adjusted R-squared:  0.2269 
## F-statistic: 469.9 on 1 and 1597 DF,  p-value: < 2.2e-16

Based on the value of R-squared, alcohol contributes to the wine quality only about 15.2%.

Residual sugar determines the sweetness of the wine. Most of the wine maintain an certain level of sweetness.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The quality correlates with alcohol and volatile acidity.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Citric acid is one of the main components of fixed acidity. As a result, they have a strong positive correlation.

High fixed acidity causes low pH. Therefore, fixed acidity and citric acid negatively correlates with the pH.

Wine with more volatile acidity tends to have less citric acid.

Wine with more fixed acidity tends to denser. By the way, A wine with more alcohol tends to less dense.

What was the strongest relationship you found?

The fixed acidity is positively and strongly correlated with citric acid and density. The citric acid may substitute for fixed acidity and density with even better estimation of wine quality.

Multivariate Plots Section

c(cor(wqr$volatile.acidity, wqr$sulphates), 
  cor(wqr$volatile.acidity, log10(wqr$sulphates)))
## [1] -0.2609867 -0.3005487

Transformation of sulphates to log10(sulphates) increase the correlation between sulphates and volatile acidity.

c(cor(wqr$alcohol, wqr$pH), cor(wqr$alcohol, wqr$pH^7))
## [1] 0.2056325 0.2287039

Transformation of pH to pH^7 increases the correlation between pH and alcohol little bit. As shown below, this leads the increase of our model accuracy little bit.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wqr)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wqr)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
##     data = wqr)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     chlorides, data = wqr)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     chlorides + total.sulfur.dioxide, data = wqr)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     chlorides + total.sulfur.dioxide + pH, data = wqr)
## m7: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     chlorides + total.sulfur.dioxide + pH + citric.acid, data = wqr)
## 
## ==========================================================================================================================
##                              m1            m2            m3            m4            m5            m6            m7       
## --------------------------------------------------------------------------------------------------------------------------
##   (Intercept)               1.875***      3.095***      2.611***      2.777***      3.005***      4.296***      4.613***  
##                            (0.175)       (0.184)       (0.196)       (0.199)       (0.204)       (0.400)       (0.461)    
##   alcohol                   0.361***      0.314***      0.309***      0.292***      0.277***      0.291***      0.295***  
##                            (0.017)       (0.016)       (0.016)       (0.016)       (0.016)       (0.017)       (0.017)    
##   volatile.acidity                       -1.384***     -1.221***     -1.167***     -1.142***     -1.038***     -1.115***  
##                                          (0.095)       (0.097)       (0.097)       (0.097)       (0.100)       (0.115)    
##   sulphates                                             0.679***      0.874***      0.915***      0.889***      0.899***  
##                                                        (0.101)       (0.111)       (0.110)       (0.110)       (0.110)    
##   chlorides                                                          -1.645***     -1.705***     -2.002***     -1.915***  
##                                                                      (0.394)       (0.392)       (0.398)       (0.403)    
##   total.sulfur.dioxide                                                             -0.002***     -0.002***     -0.002***  
##                                                                                    (0.001)       (0.001)       (0.001)    
##   pH                                                                                             -0.435***     -0.525***  
##                                                                                                  (0.116)       (0.133)    
##   citric.acid                                                                                                  -0.167     
##                                                                                                                (0.121)    
## --------------------------------------------------------------------------------------------------------------------------
##   R-squared                 0.227         0.317         0.336         0.343         0.351         0.357         0.358     
##   adj. R-squared            0.226         0.316         0.335         0.341         0.349         0.355         0.355     
##   sigma                     0.710         0.668         0.659         0.655         0.651         0.649         0.649     
##   F                       468.267       370.379       268.912       208.125       172.683       147.427       126.712     
##   p                         0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood        -1721.057     -1621.814     -1599.384     -1590.682     -1580.383     -1573.351     -1572.389     
##   Deviance                805.870       711.796       692.105       684.612       675.850       669.931       669.126     
##   AIC                    3448.114      3251.628      3208.768      3193.364      3174.767      3162.701      3162.778     
##   BIC                    3464.245      3273.136      3235.654      3225.626      3212.407      3205.719      3211.173     
##   N                      1599          1599          1599          1599          1599          1599          1599         
## ==========================================================================================================================

The first trial of linear model accounts for 35.7% of the variance. The variables with less significance were removed.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wqr)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wqr)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)), 
##     data = wqr)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)) + 
##     chlorides, data = wqr)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)) + 
##     chlorides + total.sulfur.dioxide, data = wqr)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)) + 
##     chlorides + total.sulfur.dioxide + I(pH^7), data = wqr)
## m7: lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)) + 
##     chlorides + total.sulfur.dioxide + I(pH^7) + citric.acid, 
##     data = wqr)
## 
## ==========================================================================================================================
##                              m1            m2            m3            m4            m5            m6            m7       
## --------------------------------------------------------------------------------------------------------------------------
##   (Intercept)               1.875***      3.095***      3.369***      3.742***      3.998***      4.003***      4.099***  
##                            (0.175)       (0.184)       (0.184)       (0.201)       (0.208)       (0.207)       (0.212)    
##   alcohol                   0.361***      0.314***      0.303***      0.285***      0.270***      0.289***      0.295***  
##                            (0.017)       (0.016)       (0.016)       (0.016)       (0.016)       (0.017)       (0.017)    
##   volatile.acidity                       -1.384***     -1.156***     -1.099***     -1.076***     -0.940***     -1.043***  
##                                          (0.095)       (0.097)       (0.098)       (0.097)       (0.101)       (0.114)    
##   I(log10(sulphates))                                   1.477***      1.794***      1.843***      1.849***      1.894***  
##                                                        (0.177)       (0.190)       (0.189)       (0.188)       (0.190)    
##   chlorides                                                          -1.694***     -1.729***     -2.063***     -1.935***  
##                                                                      (0.383)       (0.380)       (0.385)       (0.390)    
##   total.sulfur.dioxide                                                             -0.002***     -0.002***     -0.002***  
##                                                                                    (0.001)       (0.001)       (0.001)    
##   I(pH^7)                                                                                        -0.000***     -0.000***  
##                                                                                                  (0.000)       (0.000)    
##   citric.acid                                                                                                  -0.228     
##                                                                                                                (0.118)    
## --------------------------------------------------------------------------------------------------------------------------
##   R-squared                 0.227         0.317         0.345         0.353         0.361         0.370         0.371     
##   adj. R-squared            0.226         0.316         0.344         0.352         0.359         0.367         0.368     
##   sigma                     0.710         0.668         0.654         0.650         0.646         0.642         0.642     
##   F                       468.267       370.379       280.646       217.837       180.338       155.588       134.130     
##   p                         0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood        -1721.057     -1621.814     -1587.752     -1577.984     -1568.023     -1557.699     -1555.809     
##   Deviance                805.870       711.796       682.108       673.825       665.482       656.943       655.393     
##   AIC                    3448.114      3251.628      3185.503      3167.967      3150.046      3131.397      3129.619     
##   BIC                    3464.245      3273.136      3212.389      3200.230      3187.686      3174.414      3178.013     
##   N                      1599          1599          1599          1599          1599          1599          1599         
## ==========================================================================================================================

The variables in this linear model can account for 37.0% of the variance in the quality of the wine. By using log10(sulphates) and pH^7, we could improve the result compared to 35.7% without transformation.

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)) + 
##     chlorides + total.sulfur.dioxide + I(pH^7) + citric.acid, 
##     data = wqr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.63753 -0.37786 -0.03801  0.44159  1.96876 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.099e+00  2.123e-01  19.308  < 2e-16 ***
## alcohol               2.948e-01  1.708e-02  17.260  < 2e-16 ***
## volatile.acidity     -1.043e+00  1.138e-01  -9.161  < 2e-16 ***
## I(log10(sulphates))   1.894e+00  1.895e-01   9.995  < 2e-16 ***
## chlorides            -1.935e+00  3.905e-01  -4.954 8.04e-07 ***
## total.sulfur.dioxide -2.207e-03  5.023e-04  -4.394 1.18e-05 ***
## I(pH^7)              -6.244e-05  1.265e-05  -4.936 8.83e-07 ***
## citric.acid          -2.281e-01  1.176e-01  -1.940   0.0525 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6418 on 1591 degrees of freedom
## Multiple R-squared:  0.3711, Adjusted R-squared:  0.3684 
## F-statistic: 134.1 on 7 and 1591 DF,  p-value: < 2.2e-16

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Transformation of sulphate and pH increases the correlations with other variables. These transformations give clue to make a better linear model.

Were there any interesting or surprising interactions between features?

High alcohol and low volatile acidity contents seem to produce better wines.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I created a couple of linear models. Though the confidence level of the model could be increased a bit by transforming a couple of variables, the final model still is not satisfactory. This can be due to the fact that our dataset contains a small number of observations. Furthermore, most of the observations are from middle-classed wines. This makes it difficult that the model predict the edge cases. Maybe a more supplement dataset with more edge cases would help to predict the accurate quality of wines.


Final Plots and Summary

Plot One

Description One

Alcohol percentage plays a primary role in determining the quality of wines. The higher the alcohol percentage, the better the wine quality. But previously from our linear model test, R-Squared value tells that alcohol alone contributes only about 22% in the variance of the wine quality. So alcohol is not the only factor which is responsible for the improvement in wine quality.

Plot Two

Description Two

The volatile acidity has a negative relation with wine quality, though it is weaker than that of alcohol. It seems that the main component of volatile acid is an acetic acid which causes the unpleasant vinegar taste.

Plot Three

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Description Three

We can see that the model fails to predict the good and bad quality wines. This is evident from the fact that most data sets contain ‘average’ quality wine and there are insufficient observations in the extreme range. The R-squared value of our model can only account for about 37.1% observations.


Reflection

The data analyzed in this project contains the chemical properties and quality information for 15,999 red wines. Based on the statistics of the chemical properties of wines, I tried to establish a model to predict the quality of each red wine.

Some of the ingredients showed strong correlations with others, and their relationship could be explained chemically. Alcohol and volatile acidity were directly correlated with the quality of the wine, and these characteristics helped to establish a quality prediction model. Sulphate and pH were able to increase the correlation with quality by using variable transformations, and this attempt helped to raise the quality prediction accuracy to some extent.

The resultant model had a low prediction success rate because most of the wine contained in the data had a quality of 5 or 6 and the number of samples for other qualities was not sufficient.